Process Data

Extract data for 3 sites from each parquet file.

Reformat to wide dataframe.

Resample to X-minute.

Add features.

Min/Max scaler.

Save to /processed/

Imports and Setup

Extract data for sites

upstream site = 447_2_351

main site = 446_1_350

downstream site = 446_3_349

Fill in Gaps

Resample data

! warning. resampling is just a groupby wrt time. ideally we would create a new x-minute index and do a proper interpolation using scipy, but for now we'll just go with the resampling that pandas gives us.

Add features

from timestamp:

SCRATCH

Min/Max Scaler

Sale all values between -1 and 1. This way the sin and cos are untouched and we can use the same scaler for everything.

Do it again for just lane_vehicle_speed, because later we'll need to unscale the predictions.

note

Typically, one should fit the scaler to the training data only, then apply that same scaler to the test data. Otherwise, you get information leakage (see The Elements of Statistical Learning, p 245: The Wrong and Right Way to Do Cross-validation). However, for an LSTM all of the test data is already in the training data. If the first batch contains rows 0-6 for training and 7-9 for testing, the next batch will contain rows 1-7 for training and 8-10 for testing. Row 7 is in the testing set in the first batch and in the training set in the second. Eventually all the testing data will show up in the training data in the next few batches anyway, so any scaler can just be applied to the entire dataset.

Export scaled data

Plot